Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add cANI estimation to branchwater #188

Closed
wants to merge 5 commits into from
Closed

[WIP] Add cANI estimation to branchwater #188

wants to merge 5 commits into from

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Jan 19, 2024

resolves #164

@mr-eyes mr-eyes changed the title Add cANI estimation to branchwater [WIP] Add cANI estimation to branchwater Jan 19, 2024
@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 19, 2024

@ctb @bluegenes
I have two questions for this PR.

  1. I was wondering if it's better to include ANI calculation in multisearch, multigather, and pairwise in a single PR? or sub-PRs from this?
  2. Should the ANI calculation be optional via a command line flag, or it can be a default column in the output files?

@bluegenes, Please let me know if you find any issues in the ANI code, so I can help fix them if needed.

@ctb
Copy link
Collaborator

ctb commented Jan 19, 2024

2. Should the ANI calculation be optional via a command line flag, or it can be a default column in the output files?

As long as it doesn't degrade performance, I think adding it as a default column is good!

@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 23, 2024

Reading list of query paths from: 'sig_paths.l'
Loaded 532 query signature(s)
Processed 0 comparisons
Processed 100000 comparisons
DONE. Processed 141246 comparisons

pairwise command before adding ANI

	Command being timed: "sourmash scripts pairwise sig_paths.l -c 16 -k 51 -s 10000 -o pairwise_result.csv -t 0.0"
	User time (seconds): 198.93
	System time (seconds): 4.22
	Percent of CPU this job got: 1471%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.80
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1781012
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 623829
	Voluntary context switches: 142733
	Involuntary context switches: 150950
	Swaps: 0
	File system inputs: 0
	File system outputs: 50232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

pairwise command after adding ANI

	Command being timed: "sourmash scripts pairwise sig_paths.l -c 16 -k 51 -s 10000 -o ani_pairwise_result.csv -t 0.0"
	User time (seconds): 286.95
	System time (seconds): 5.45
	Percent of CPU this job got: 1084%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.96
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1859196
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 3
	Minor (reclaiming a frame) page faults: 598861
	Voluntary context switches: 137504
	Involuntary context switches: 461133
	Swaps: 0
	File system inputs: 0
	File system outputs: 55432
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 23, 2024

@ctb
Copy link
Collaborator

ctb commented Jan 24, 2024

is that really a 2x hit in time!?

@ctb
Copy link
Collaborator

ctb commented Jan 24, 2024

@bluegenes Am I correctly calculating ANI from containment here? https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/188/files#diff-d62689bb23b648c8b52cd5f21c1bf1cf3974ef38477c5ca006c9b1741d380aa8R74-R107

This is an excellent opportunity for a test: calculate containment with regular sourmash code, then calculate it with your code, then compare the numbers...

@mr-eyes
Copy link
Member Author

mr-eyes commented Jan 24, 2024

is that really a 2x hit in time!?

Yes :/

This is an excellent opportunity for a test: calculate containment with regular sourmash code, then calculate it with your code, then compare the numbers...

I can load the comparisons CSV output and use the containment values (max, min) to calculate avg ani with sourmash then validate with the new rust ANI column. Is that a valid test?

@ctb
Copy link
Collaborator

ctb commented Jan 25, 2024

This is an excellent opportunity for a test: calculate containment with regular sourmash code, then calculate it with your code, then compare the numbers...

I can load the comparisons CSV output and use the containment values (max, min) to calculate avg ani with sourmash then validate with the new rust ANI column. Is that a valid test?

Pretty much anything can be a valid test, but it sounds more roundabout than (a) running a comparison with sourmash, (b) saving the result, (c) writing a test that runs a comparison with branchwater and checks that you get approximately the same value? Hardcoding the known-good answer into a test is fine.

@bluegenes
Copy link
Contributor

bluegenes commented Feb 14, 2024

hi @mr-eyes --

I'm moving ANI calculation into rust in sourmash-bio/sourmash#2943.

Would you like to submit a PR with the equations file into that PR? I have some modifications I'll make, but I can do it over there.

If not, lmk and I'll just add the code directly

cc @ctb

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 14, 2024

hi @mr-eyes --

I'm moving ANI calculation into rust in sourmash-bio/sourmash#2943.

Would you like to submit a PR with the equations file into that PR? I have some modifications I'll make, but I can do it over there.

If not, lmk and I'll just add the code directly

cc @ctb

I don't mind copying or modifying the code. Although I would love to contribute to the amazing PR, I am not currently able to commit. Please go for it.

@ctb
Copy link
Collaborator

ctb commented Feb 14, 2024

I don't mind copying or modifying the code. Although I would love to contribute to the amazing PR, I am not currently able to commit. Please go for it.

you just run git commit man. come on.

😁

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 14, 2024

I don't mind copying or modifying the code. Although I would love to contribute to the amazing PR, I am not currently able to commit. Please go for it.

you just run git commit man. come on.

😁

Hahaha, ok as long as my committed code will not be judged

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 14, 2024

@bluegenes @ctb
I thought I would do the full integration by opening a PR. That sourmash-bio/sourmash#3005 was fast :) Thank you!

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 24, 2024

Closing in favor of #236

@mr-eyes mr-eyes closed this Feb 24, 2024
bluegenes added a commit that referenced this pull request Feb 26, 2024
Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943

Split from #234 to make things more concise/simpler.

## benchmark summary

## 12k ICTV viral genomes, scaled=200

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 21s     | 
| PR | with ANI     | 20s     |
| v0.9.0 | no ANI | 21s |


## 12k ICTV viral genomes, scaled=10

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 14m 0s     | 
| PR | with ANI     | 14m 6s     | 
| main branch (~v0.9.0) | no ANI | 14m 47s |
@ctb ctb deleted the ani branch June 11, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calculate ANI based on a CLI flag
3 participants